We here make use of the publication of Anna Cuomo et al. (last author Oliver Stegle), which we will refer to as the iPSC dataset. The paper that describes this dataset can be found using this link.
In the experiment, the authors harvested induced pluripotent stem cells (iPSCs) from 125 healthy human donors. These cells were used to study the endoderm differentiation process. In this process, iPSCs differentiate to endoderm cells, a process which takes approximately three days. As such, the authors cultered the iPSCs cell lines and allowed for differentiation for three days. During the experiment, cells were harvested at four different time points: day0 (directly at to incubation), day1, day2 and day3. Knowing the process of endoderm differentiation, these time points should correspond with different cell types: day0 are (undifferentiated) iPSCs, day1 are mesendoderm cells, day2 are “intermediate” cells and day3 are fully differentiated endoderm cells.
This dataset was generated using the SMART-Seq2 scRNA-seq protocol.
The final goal of the experiment was to characterize population variation in the process of endoderm differentiation.
For this lab session, we will work with a subset of the data, i.e., the data for the first (alphabetically) 15 patients in the experiment. These are the data you already downloaded for lab session 2 using the belnet filesender link.
The data original (125 patient) could be downloaded from Zenodo. At the bottom of this web-page, we can download the files raw_counts.csv.zip and cell_metadata_cols.tsv and store these files locally. We do not recommend doing this during the lab session, to avoid overloading the system.
First we read in the count matrix:
library(SingleCellExperiment)
sce <- readRDS("/Users/jg/Desktop/sce_15_cuomo.rds")
Exploration of the metadata is essential to get a better idea of what the experiment was about and how it was organized.
colData(sce)[1:5,1:10]
## DataFrame with 5 rows and 10 columns
## assigned auxDir cell_filter cell_name
## <integer> <character> <logical> <character>
## 21554_5#104 1 aux_info TRUE 21554_5#104
## 21554_5#110 1 aux_info TRUE 21554_5#110
## 21554_5#113 1 aux_info TRUE 21554_5#113
## 21554_5#117 1 aux_info TRUE 21554_5#117
## 21554_5#127 1 aux_info TRUE 21554_5#127
## compatible_fragment_ratio day donor expected_format
## <numeric> <character> <character> <character>
## 21554_5#104 0.999981 day2 dixh IU
## 21554_5#110 0.999964 day2 dixh IU
## 21554_5#113 0.999945 day2 dixh IU
## 21554_5#117 0.999916 day2 dixh IU
## 21554_5#127 0.999863 day2 dixh IU
## experiment frag_dist_length
## <character> <integer>
## 21554_5#104 expt_21 1001
## 21554_5#110 expt_21 1001
## 21554_5#113 expt_21 1001
## 21554_5#117 expt_21 1001
## 21554_5#127 expt_21 1001
colnames(colData(sce))
## [1] "assigned"
## [2] "auxDir"
## [3] "cell_filter"
## [4] "cell_name"
## [5] "compatible_fragment_ratio"
## [6] "day"
## [7] "donor"
## [8] "expected_format"
## [9] "experiment"
## [10] "frag_dist_length"
## [11] "gc_bias_correct"
## [12] "is_cell_control"
## [13] "is_cell_control_bulk"
## [14] "is_cell_control_control"
## [15] "library_types"
## [16] "libType"
## [17] "log10_total_counts"
## [18] "log10_total_counts_endogenous"
## [19] "log10_total_counts_ERCC"
## [20] "log10_total_counts_feature_control"
## [21] "log10_total_counts_MT"
## [22] "log10_total_features"
## [23] "log10_total_features_endogenous"
## [24] "log10_total_features_ERCC"
## [25] "log10_total_features_feature_control"
## [26] "log10_total_features_MT"
## [27] "mapping_type"
## [28] "mates1"
## [29] "mates2"
## [30] "n_alt_reads"
## [31] "n_total_reads"
## [32] "num_assigned_fragments"
## [33] "num_bias_bins"
## [34] "num_bootstraps"
## [35] "num_compatible_fragments"
## [36] "num_consistent_mappings"
## [37] "num_inconsistent_mappings"
## [38] "num_libraries"
## [39] "num_mapped"
## [40] "num_processed"
## [41] "num_targets"
## [42] "nvars_used"
## [43] "pct_counts_endogenous"
## [44] "pct_counts_ERCC"
## [45] "pct_counts_feature_control"
## [46] "pct_counts_MT"
## [47] "pct_counts_top_100_features"
## [48] "pct_counts_top_100_features_endogenous"
## [49] "pct_counts_top_100_features_feature_control"
## [50] "pct_counts_top_200_features"
## [51] "pct_counts_top_200_features_endogenous"
## [52] "pct_counts_top_50_features"
## [53] "pct_counts_top_50_features_endogenous"
## [54] "pct_counts_top_50_features_ERCC"
## [55] "pct_counts_top_50_features_feature_control"
## [56] "pct_counts_top_500_features"
## [57] "pct_counts_top_500_features_endogenous"
## [58] "percent_mapped"
## [59] "plate_id"
## [60] "plate_well_id"
## [61] "post_prob"
## [62] "public_name"
## [63] "read_files"
## [64] "salmon_version"
## [65] "samp_type"
## [66] "sample_id"
## [67] "seq_bias_correct"
## [68] "size_factor"
## [69] "start_time"
## [70] "strand_mapping_bias"
## [71] "total_counts"
## [72] "total_counts_endogenous"
## [73] "total_counts_ERCC"
## [74] "total_counts_feature_control"
## [75] "total_counts_MT"
## [76] "total_features"
## [77] "total_features_endogenous"
## [78] "total_features_ERCC"
## [79] "total_features_feature_control"
## [80] "total_features_MT"
## [81] "used_in_expt"
## [82] "well_id"
## [83] "well_type"
## [84] "donor_short_id"
## [85] "donor_long_id"
## [86] "pseudo"
## [87] "PC1_top100hvgs"
## [88] "PC1_top200hvgs"
## [89] "PC1_top500hvgs"
## [90] "PC1_top1000hvgs"
## [91] "PC1_top2000hvgs"
## [92] "princ_curve"
## [93] "princ_curve_scaled01"
As stated in the paper, cells were sampled on 4 time points. Each of these time points is expected to correspond with different cell types (day0 = iPSC, day1 = mesendoderm, day2 = intermediate and day3 = endoderm).
table(colData(sce)$day)
##
## day0 day1 day2 day3
## 876 987 1124 890
As stated in the paper, cells were harvested from 125 patients. Here, we are working on a subset with 15 patients. The number of cells harvested per patient (over all time points) ranges from 31 to 637.
length(table(colData(sce)$donor)) # number of donors
## [1] 15
range(table(colData(sce)$donor)) # cells per donor
## [1] 31 637
Below, we look how many cells are harvest per patent and per time point.
table(colData(sce)$donor,colData(sce)$day)
##
## day0 day1 day2 day3
## aowh 88 100 93 95
## aoxv 68 58 96 71
## babz 28 0 41 0
## bezi 13 11 4 3
## bima 0 0 44 31
## bokz 159 200 164 114
## cicb 42 21 75 26
## ciwj 40 27 35 39
## cuhk 41 47 39 27
## datg 185 147 136 115
## dixh 0 46 73 84
## eesb 66 106 103 195
## eipl 99 189 198 57
## eiwy 25 18 10 25
## eoxi 22 17 13 8
We see that for many patients the data is complete, i.e. cells were sampled on all time points.
Practically, the cells were prepared in 28 batches. Since we here only look at a subset of the data, we see that only 14 of these batches are represented here.
length(table(colData(sce)$experiment))
## [1] 14
table(colData(sce)$experiment, colData(sce)$day)
##
## day0 day1 day2 day3
## expt_21 0 46 73 84
## expt_22 22 17 13 8
## expt_24 28 0 41 0
## expt_29 73 91 93 86
## expt_30 15 9 0 9
## expt_31 83 68 114 53
## expt_33 70 49 53 64
## expt_34 274 298 247 165
## expt_36 25 18 10 25
## expt_39 13 11 4 3
## expt_41 99 189 198 57
## expt_42 0 0 44 31
## expt_43 134 164 199 266
## expt_45 40 27 35 39
The rowData slot of a SingleCellExperiment object allows for storing information on the features, i.e. the genes, in a dataset. In our object, the rowData slot currently contains the following:
head(rowData(sce))
## DataFrame with 6 rows and 1 column
## V1
## <character>
## 1 ENSG00000000003_TSPAN6
## 2 ENSG00000000419_DPM1
## 3 ENSG00000000457_SCYL3
## 4 ENSG00000000460_C1or..
## 5 ENSG00000001036_FUCA2
## 6 ENSG00000001084_GCLC
To improve our gene-level information, we may:
Split V1 into two columns, one with the ENSEMBL ID and the other with the gene symbol.
Display which chromosome the gene is located
Many more options are possible, but are not necessary for us right now.
rowData(sce) <- data.frame(Ensembl = gsub("_.*", "", rowData(sce)$V1),
Symbol = gsub("^[^_]*_", "", rowData(sce)$V1))
head(rowData(sce))
## DataFrame with 6 rows and 2 columns
## Ensembl Symbol
## <character> <character>
## 1 ENSG00000000003 TSPAN6
## 2 ENSG00000000419 DPM1
## 3 ENSG00000000457 SCYL3
## 4 ENSG00000000460 C1orf112
## 5 ENSG00000001036 FUCA2
## 6 ENSG00000001084 GCLC
# currently issues with ensembl server -> do not evaluate this chunk
library("biomaRt")
ensembl75 <- useEnsembl(biomart = 'genes',
dataset = 'hsapiens_gene_ensembl',
version = 75)
GeneInfo <- getBM(attributes = c("ensembl_gene_id", # To match with rownames SCE
"chromosome_name"), # Info on chromose
mart = ensembl75)
GeneInfo <- GeneInfo[match(rowData(sce)$Ensembl, GeneInfo$ensembl_gene_id),]
rowData(sce) <- cbind(rowData(sce), GeneInfo)
head(rowData(sce))
## DataFrame with 6 rows and 4 columns
## Ensembl Symbol ensembl_gene_id chromosome_name
## <character> <character> <character> <character>
## 1 ENSG00000000003 TSPAN6 ENSG00000000003 X
## 2 ENSG00000000419 DPM1 ENSG00000000419 20
## 3 ENSG00000000457 SCYL3 ENSG00000000457 1
## 4 ENSG00000000460 C1orf112 ENSG00000000460 1
## 5 ENSG00000001036 FUCA2 ENSG00000001036 6
## 6 ENSG00000001084 GCLC ENSG00000001084 6
all(rowData(sce)$Ensembl == rowData(sce)$ensembl_gene_id)
## [1] TRUE
# identical, as desired, so we could optionally remove one of the two
Let us first try the very simple and very lenient filtering criterion that we adopted for the Macosko dataset.
keep <- rowSums(assays(sce)$counts > 0) > 10
table(keep)
## keep
## TRUE
## 11231
We see that this filtering strategy does not remove any genes for this dataset. In general, datasets from plate-based scRNA-seq dataset have a far higher sequencing depth than data from droplet-based protocols. As requiring a minimum expression of 1 count in at least 10 cells is a very lenient criterion if we consider that we have 36.000 cells, we should consider adopting a more stringent filtering criterium, like the filterByExpr from edgeR:
library(edgeR)
table(colData(sce)$day)
##
## day0 day1 day2 day3
## 876 987 1124 890
keep2 <- edgeR::filterByExpr(y=sce,
group = colData(sce)$day,
min.count = 5,
min.prop = 0.4)
table(keep2)
## keep2
## FALSE TRUE
## 857 10374
sce <- sce[keep2,]
library(scater)
## Loading required package: scuttle
## Loading required package: ggplot2
##
## Attaching package: 'scater'
## The following object is masked from 'package:limma':
##
## plotMDS
# check ERCC spike-in transcripts
sum(grepl("^ERCC-", rowData(sce)$Symbol)) # no spike-in transcripts available
## [1] 0
is.mito <- grepl("^MT", rowData(sce)$chromosome_name)
sum(is.mito) # 13 mitochondrial genes
## [1] 13
df <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))
head(df)
## DataFrame with 6 rows and 6 columns
## sum detected subsets_Mito_sum subsets_Mito_detected
## <numeric> <numeric> <numeric> <numeric>
## 21554_5#104 138676.3 5305 77.5935 7
## 21554_5#110 685123.5 5927 402.2876 8
## 21554_5#113 1671911.4 5613 1010.8276 9
## 21554_5#117 90419.4 6066 51.1047 6
## 21554_5#127 59463.2 6549 28.5289 6
## 21554_5#128 416482.7 7870 153.9212 7
## subsets_Mito_percent total
## <numeric> <numeric>
## 21554_5#104 0.0559530 138676.3
## 21554_5#110 0.0587175 685123.5
## 21554_5#113 0.0604594 1671911.4
## 21554_5#117 0.0565196 90419.4
## 21554_5#127 0.0479774 59463.2
## 21554_5#128 0.0369574 416482.7
## add the QC variables to sce object
colData(sce) <- cbind(colData(sce), df)
In the figure below, we see that several cells have a very low number of expressed genes, and where most of the molecules are derived from mitochondrial genes. This indicates likely damaged cells, presumably because of loss of cytoplasmic RNA from perforated cells, so we should remove these for the downstream analysis.
# Number of genes vs library size
plotColData(sce, x = "sum", y="detected", colour_by="day")
# Mitochondrial genes
plotColData(sce, x = "detected", y="subsets_Mito_percent", colour_by="day")
Below, we remove cells that are outlying with respect to
We remove a total of \(301\) cells, mainly due to low sequencing depth and low number of genes detected.
lowLib <- isOutlier(df$sum, type="lower", log=TRUE)
lowFeatures <- isOutlier(df$detected, type="lower", log=TRUE)
highMito <- isOutlier(df$subsets_Mito_percent, type="higher")
table(lowLib)
## lowLib
## FALSE TRUE
## 3676 201
table(lowFeatures)
## lowFeatures
## FALSE TRUE
## 3813 64
table(highMito)
## highMito
## FALSE TRUE
## 3852 25
discardCells <- (lowLib | lowFeatures | highMito)
table(discardCells)
## discardCells
## FALSE TRUE
## 3608 269
colData(sce)$discardCells <- discardCells
# visualize cells to be removed
plotColData(sce, x = "detected", y="subsets_Mito_percent", colour_by = "discardCells")
plotColData(sce, x = "sum", y="detected", colour_by="discardCells")
# visualize cells to be removed
plotColData(sce, x = "detected", y="subsets_Mito_percent", colour_by = "donor")
plotColData(sce, x = "sum", y="detected", colour_by="donor")
# visualize cells to be removed
plotColData(sce, x = "detected", y="subsets_Mito_percent", colour_by = "experiment")
plotColData(sce, x = "sum", y="detected", colour_by="experiment")
table(sce$donor, sce$discardCells)
##
## FALSE TRUE
## aowh 367 9
## aoxv 284 9
## babz 44 25
## bezi 30 1
## bima 73 2
## bokz 624 13
## cicb 152 12
## ciwj 135 6
## cuhk 135 19
## datg 566 17
## dixh 90 113
## eesb 452 18
## eipl 537 6
## eiwy 77 1
## eoxi 42 18
table(sce$donor, sce$discardCells)/rowSums(table(sce$donor, sce$discardCells))
##
## FALSE TRUE
## aowh 0.97606383 0.02393617
## aoxv 0.96928328 0.03071672
## babz 0.63768116 0.36231884
## bezi 0.96774194 0.03225806
## bima 0.97333333 0.02666667
## bokz 0.97959184 0.02040816
## cicb 0.92682927 0.07317073
## ciwj 0.95744681 0.04255319
## cuhk 0.87662338 0.12337662
## datg 0.97084048 0.02915952
## dixh 0.44334975 0.55665025
## eesb 0.96170213 0.03829787
## eipl 0.98895028 0.01104972
## eiwy 0.98717949 0.01282051
## eoxi 0.70000000 0.30000000
#fractions of removed cells per donor
Most removed cells (fraction) are from patients dixh and babz.
table(sce$experiment, sce$discardCells)
##
## FALSE TRUE
## expt_21 90 113
## expt_22 42 18
## expt_24 44 25
## expt_29 336 7
## expt_30 31 2
## expt_31 287 31
## expt_33 227 9
## expt_34 963 21
## expt_36 77 1
## expt_39 30 1
## expt_41 537 6
## expt_42 73 2
## expt_43 736 27
## expt_45 135 6
table(sce$experiment, sce$donor)
##
## aowh aoxv babz bezi bima bokz cicb ciwj cuhk datg dixh eesb eipl eiwy
## expt_21 0 0 0 0 0 0 0 0 0 0 203 0 0 0
## expt_22 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## expt_24 0 0 69 0 0 0 0 0 0 0 0 0 0 0
## expt_29 343 0 0 0 0 0 0 0 0 0 0 0 0 0
## expt_30 33 0 0 0 0 0 0 0 0 0 0 0 0 0
## expt_31 0 0 0 0 0 0 164 0 154 0 0 0 0 0
## expt_33 0 0 0 0 0 0 0 0 0 236 0 0 0 0
## expt_34 0 0 0 0 0 637 0 0 0 347 0 0 0 0
## expt_36 0 0 0 0 0 0 0 0 0 0 0 0 0 78
## expt_39 0 0 0 31 0 0 0 0 0 0 0 0 0 0
## expt_41 0 0 0 0 0 0 0 0 0 0 0 0 543 0
## expt_42 0 0 0 0 75 0 0 0 0 0 0 0 0 0
## expt_43 0 293 0 0 0 0 0 0 0 0 0 470 0 0
## expt_45 0 0 0 0 0 0 0 141 0 0 0 0 0 0
##
## eoxi
## expt_21 0
## expt_22 60
## expt_24 0
## expt_29 0
## expt_30 0
## expt_31 0
## expt_33 0
## expt_34 0
## expt_36 0
## expt_39 0
## expt_41 0
## expt_42 0
## expt_43 0
## expt_45 0
Most removed cells (fraction) are from patients dixh and babz. Most low library sizes seem to come from patient dixh; for patient babz the effect is less pronounced.
plotColData(sce[,sce$donor=="dixh"], x = "sum", y="detected")
plotColData(sce[,sce$donor=="babz"], x = "sum", y="detected")
As such, we are mainly removing cells from specific patients and the respective batches in which they were sequenced. However, we want to be careful; we only want to remove technical artefacts, while retaining as much of the biology as possible. In our exploratory figure, we see that the cells we are removing based on the number of genes detected, are quite far apart from the bulk of the data cloud; as such, these cells are indeed suspicious. For the criterion of library size, we see that the cells removed there are still strongly connected to the data cloud. As such, we may want to relax the filtering criterion there a little bit. When we think of how the adaptive threshold strategy works, we may want to remove cells that are 4MADs away from the center, rather than the default 3 MADs.
# previously
lowLib <- isOutlier(df$sum, type="lower", log=TRUE)
table(lowLib)
## lowLib
## FALSE TRUE
## 3676 201
# after seeing appropriate exploratory figure
lowLib <- isOutlier(df$sum, nmads=4, type="lower", log=TRUE)
table(lowLib)
## lowLib
## FALSE TRUE
## 3783 94
discardCells <- (lowLib | lowFeatures | highMito)
table(discardCells)
## discardCells
## FALSE TRUE
## 3706 171
colData(sce)$discardCells <- discardCells
Note that these steps are not exact; different analysts will come with different filtering criteria for many of the steps. The key ideas are that we let appropriate exploratory figures guide us to make reasonable choices; i.e., we look at the data rather than blindly following a standardized pipeline that may work well in many cases, but maybe not our particular dataset.
# remove cells identified using adaptive thresholds
sce <- sce[, !colData(sce)$discardCells]
For normalization, the size factors \(s_i\) computed here are simply scaled library sizes:
\[ N_i = \sum_g Y_{gi} \] \[ s_i = N_i / \bar{N}_i \]
sce <- logNormCounts(sce)
# note we also returned log counts: see the additional logcounts assay.
sce
## class: SingleCellExperiment
## dim: 10374 3706
## metadata(0):
## assays(2): counts logcounts
## rownames: NULL
## rowData names(4): Ensembl Symbol ensembl_gene_id chromosome_name
## colnames(3706): 21554_5#128 21554_5#142 ... 24947_6#91 24947_6#98
## colData names(101): assigned auxDir ... discardCells sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
# you can extract size factors using
sf <- librarySizeFactors(sce)
mean(sf) # equal to 1 due to scaling.
## [1] 1
plot(x= log(colSums(assays(sce)$counts)),
y=sf)
— end lab session 1 —
library(scran)
rownames(sce) <- rowData(sce)$Ensembl
dec <- modelGeneVar(sce)
head(dec)
## DataFrame with 6 rows and 6 columns
## mean total tech bio p.value FDR
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 5.45476 0.863737 1.28367 -0.4199327 0.773284 0.885077
## ENSG00000000419 5.83407 1.029855 1.07569 -0.0458302 0.538891 0.885077
## ENSG00000000457 0.76369 1.179007 1.73788 -0.5588690 0.769433 0.885077
## ENSG00000000460 3.11235 1.544099 2.62979 -1.0856866 0.827958 0.890125
## ENSG00000001036 3.57638 2.179776 2.45008 -0.2703043 0.599802 0.885077
## ENSG00000001084 1.70225 2.384638 2.56580 -0.1811642 0.564274 0.885077
fit <- metadata(dec)
plot(fit$mean, fit$var,
xlab="Mean of log-expression",
ylab="Variance of log-expression")
curve(fit$trend(x), col="dodgerblue", add=TRUE, lwd=2)
# get top 1000 highly variable genes
hvg <- getTopHVGs(dec,
n=1000)
head(hvg)
## [1] "ENSG00000147869" "ENSG00000158815" "ENSG00000095596" "ENSG00000104371"
## [5] "ENSG00000185155" "ENSG00000120937"
# plot these
plot(fit$mean, fit$var,
col = c("orange", "darkseagreen3")[(names(fit$mean) %in% hvg)+1],
xlab="Mean of log-expression",
ylab="Variance of log-expression")
curve(fit$trend(x), col="dodgerblue", add=TRUE, lwd=2)
legend("topleft",
legend = c("Selected", "Not selected"),
col = c("darkseagreen3", "orange"),
pch = 16,
bty='n')
set.seed(1234)
sce <- runPCA(sce,
ncomponents=30,
subset_row=hvg)
plotPCA(sce,
colour_by = "day")
PCA has been performed. The PCA information has been automatically stored in the reducedDim slot of the SingleCellExperiment object.
reducedDimNames(sce)
## [1] "PCA"
head(reducedDim(sce,
type="PCA"))
## PC1 PC2 PC3 PC4 PC5 PC6
## 21554_5#128 -27.328077 9.763073 -9.584141 32.27431 16.318113 26.19347
## 21554_5#142 -26.937387 8.439599 -6.991705 34.12408 -5.307289 22.88428
## 21554_5#174 -16.446209 16.527976 -7.808878 30.80647 20.986857 22.08394
## 21554_5#176 -4.001995 15.540162 -21.952635 28.50917 32.577862 17.57834
## 21554_5#181 -22.177901 7.610681 -7.919849 37.49862 13.233910 24.88041
## 21554_5#183 -16.008230 15.307207 14.797829 33.50458 -1.367002 30.79008
## PC7 PC8 PC9 PC10 PC11 PC12
## 21554_5#128 10.530287 -1.0154662 -1.6800997 2.377893 6.771425 2.5234087
## 21554_5#142 -8.181347 -9.2145720 15.9249976 -11.010488 1.356081 4.3907040
## 21554_5#174 -9.511922 1.4888081 6.0293625 1.819305 -18.203208 -8.5940590
## 21554_5#176 1.351036 1.7864469 8.9918127 -3.981641 -21.903486 -4.5175264
## 21554_5#181 1.636555 0.4692383 -4.4855861 2.750001 6.352438 -0.1504927
## 21554_5#183 -6.627750 -6.8580220 0.7820617 -3.899674 2.633024 2.2719994
## PC13 PC14 PC15 PC16 PC17 PC18
## 21554_5#128 -3.4168747 0.06145093 1.3050678 -0.7893209 1.760482 0.5484686
## 21554_5#142 -9.5771564 -8.42697321 -0.6693316 2.9661022 -3.043214 -1.2396564
## 21554_5#174 -5.7212686 -2.08958546 4.3690819 -1.0496434 5.238671 0.4381355
## 21554_5#176 0.4918523 -15.12470483 -4.9644227 2.2072936 4.942645 -0.8018322
## 21554_5#181 -5.1349512 8.07271937 3.4761499 -9.4085614 4.276378 1.1399724
## 21554_5#183 -2.8795225 1.59825375 4.8692262 -2.6722691 6.107270 -2.4544138
## PC19 PC20 PC21 PC22 PC23 PC24
## 21554_5#128 -5.58283701 -6.708292 6.27060018 3.609027 -2.2596501 0.9670209
## 21554_5#142 0.39254873 9.606196 1.41932347 -3.709020 12.4757530 -0.5652936
## 21554_5#174 -0.01610044 2.707064 -2.48392860 -2.329172 5.1986465 6.7652448
## 21554_5#176 5.59664140 -3.664225 -2.19573555 1.819274 4.3650418 6.1996788
## 21554_5#181 -3.05563321 -3.538269 0.54857034 -2.184846 -0.8662808 -0.1549631
## 21554_5#183 -2.35016714 -2.545329 -0.02554832 1.688935 3.5987351 -1.9657998
## PC25 PC26 PC27 PC28 PC29 PC30
## 21554_5#128 -0.02582576 -0.7677610 3.38271405 -4.4738900 -5.539883 -4.6565710
## 21554_5#142 -2.62344588 -0.8815258 -4.44224867 -1.5357031 4.672024 -3.8431734
## 21554_5#174 -4.78566943 7.1501209 -0.03786239 -4.6940387 -7.248648 -1.7295850
## 21554_5#176 -2.96075502 5.7176329 -0.13407015 -2.8377563 -2.777841 -7.2698016
## 21554_5#181 2.45070735 -0.8947824 0.34532035 -4.1242166 1.894310 1.2173858
## 21554_5#183 -2.44972909 3.1770976 -0.29954469 -0.8578297 -2.925045 0.4022497
The plotPCA function of the scater package now allows us to visualize the cells in PCA space, based on the PCA information stored in our object:
plotPCA(sce,
colour_by = "day")
library(glmpca)
set.seed(211103)
poipca <- glmpca(Y = assays(sce)$counts[hvg,],
L = 2,
fam = "poi",
minibatch = "stochastic")
reducedDim(sce, "PoiPCA") <- poipca$factors
plotReducedDim(sce,
dimred="PoiPCA",
colour_by = "day")
set.seed(8778)
sce <- runTSNE(sce,
dimred = 'PCA',
external_neighbors=TRUE)
plotTSNE(sce,
colour_by = "day")
set.seed(65187)
sce <- runPCA(sce,
ncomponents=30,
subset_row=hvg)
sce <- runUMAP(sce,
dimred = 'PCA',
pca = 12,
external_neighbors = TRUE)
plotUMAP(sce,
colour_by = "day")
plotUMAP(sce,
colour_by = "donor")
plotUMAP(sce,
colour_by = "experiment")
— end lab session 2 —
table(sce$donor,sce$experiment)
##
## expt_21 expt_22 expt_24 expt_29 expt_30 expt_31 expt_33 expt_34 expt_36
## aowh 0 0 0 342 32 0 0 0 0
## aoxv 0 0 0 0 0 0 0 0 0
## babz 0 0 54 0 0 0 0 0 0
## bezi 0 0 0 0 0 0 0 0 0
## bima 0 0 0 0 0 0 0 0 0
## bokz 0 0 0 0 0 0 0 634 0
## cicb 0 0 0 0 0 155 0 0 0
## ciwj 0 0 0 0 0 0 0 0 0
## cuhk 0 0 0 0 0 140 0 0 0
## datg 0 0 0 0 0 0 235 343 0
## dixh 109 0 0 0 0 0 0 0 0
## eesb 0 0 0 0 0 0 0 0 0
## eipl 0 0 0 0 0 0 0 0 0
## eiwy 0 0 0 0 0 0 0 0 78
## eoxi 0 47 0 0 0 0 0 0 0
##
## expt_39 expt_41 expt_42 expt_43 expt_45
## aowh 0 0 0 0 0
## aoxv 0 0 0 292 0
## babz 0 0 0 0 0
## bezi 30 0 0 0 0
## bima 0 0 73 0 0
## bokz 0 0 0 0 0
## cicb 0 0 0 0 0
## ciwj 0 0 0 0 137
## cuhk 0 0 0 0 0
## datg 0 0 0 0 0
## dixh 0 0 0 0 0
## eesb 0 0 0 466 0
## eipl 0 539 0 0 0
## eiwy 0 0 0 0 0
## eoxi 0 0 0 0 0
# target effect in PCA space, all time points
plotPCA(sce,
colour_by = "day")
# donor (nuisance) effect in PCA space, all time points
plotPCA(sce,
colour_by = "donor")
# experiment (nuisance) effect in PCA space, all time points
plotPCA(sce,
colour_by = "experiment")
# donor effect in PCA space, per time point
plotPCA(sce[,sce$day=="day0"],
colour_by = "donor")
plotPCA(sce[,sce$day=="day1"],
colour_by = "donor")
plotPCA(sce[,sce$day=="day2"],
colour_by = "donor")
plotPCA(sce[,sce$day=="day3"],
colour_by = "donor")
# nuisance effects in t-SNE space, all time points
plotTSNE(sce,
colour_by = "donor")
plotTSNE(sce,
colour_by = "experiment")
#saveRDS(sce, "/Users/jg/Desktop/sce_after_prep.rds")
sce <- readRDS("/Users/jg/Desktop/sce_after_prep.rds")
library(Seurat)
## Attaching SeuratObject
##
## Attaching package: 'Seurat'
## The following object is masked from 'package:SummarizedExperiment':
##
## Assays
seurat_obj <- as.Seurat(sce)
## Warning: Keys should be one or more alphanumeric characters followed by an
## underscore, setting key from PC to PC_
## Warning: All keys should be one or more alphanumeric characters followed by an
## underscore '_', setting key to PC_
## Warning: Keys should be one or more alphanumeric characters followed by an
## underscore, setting key from dim to dim_
## Warning: All keys should be one or more alphanumeric characters followed by an
## underscore '_', setting key to dim_
seurat_obj # notice the "0 variable features"
## An object of class Seurat
## 10374 features across 3706 samples within 1 assay
## Active assay: originalexp (10374 features, 0 variable features)
## 3 dimensional reductions calculated: PCA, PoiPCA, TSNE
table(seurat_obj$donor)
##
## aowh aoxv babz bezi bima bokz cicb ciwj cuhk datg dixh eesb eipl eiwy eoxi
## 374 292 54 30 73 634 155 137 140 578 109 466 539 78 47
table(seurat_obj$donor)[table(seurat_obj$donor) <= 30]
## bezi
## 30
seurat_obj <- seurat_obj[,-which(seurat_obj$donor == names(table(seurat_obj$donor)[table(seurat_obj$donor) <= 30]))]
seurat_obj.list <- SplitObject(seurat_obj, split.by = "donor")
nlevels(as.factor(sce$donor)) # originally 15 patients
## [1] 15
length(seurat_obj.list) # 14 patients left
## [1] 14
# normalize and identify variable features for each dataset (patient) independently
seurat_obj.list <- lapply(X = seurat_obj.list, FUN = function(x) {
x <- NormalizeData(x,verbose = FALSE)
x <- FindVariableFeatures(x,
selection.method = "vst",
nfeatures = 1000,
verbose = FALSE)
})
# select features that are repeatedly variable across datasets for integration
features <- SelectIntegrationFeatures(object.list = seurat_obj.list)
anchors <- FindIntegrationAnchors(object.list = seurat_obj.list,
anchor.features = features,
verbose = FALSE)
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in irlba(A = mat3, nv = num.cc): You're computing too large a percentage
## of total singular values, use a standard svd instead.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
## Warning in FilterAnchors(object = object.pair, assay = assay, slot = slot, :
## Number of anchor cells is less than k.filter. Retaining all anchors.
# this command creates an 'integrated' data assay
data.combined <- IntegrateData(anchorset = anchors,
k.weight = 30,
verbose=FALSE)
# Run the standard Seurat workflow for visualization and clustering
data.combined <- ScaleData(object = data.combined,
verbose = FALSE)
data.combined <- RunPCA(object = data.combined,
npcs = 30,
reduction.name = "PCA_SeuBatch",
verbose = FALSE)
# data.combined <- RunUMAP(object = data.combined,
# reduction = "pca",
# dims = 1:12,
# min.dist=0.4,
# n.neighbors=15,
# verbose = FALSE)
data.combined <- RunTSNE(object = data.combined,
reduction = "PCA_SeuBatch",
reduction.name = "tSNE_SeuBatch",
dims = 1:12)
data.combined <- FindNeighbors(object = data.combined,
reduction = "PCA_SeuBatch",
dims = 1:12,
verbose = FALSE)
data.combined <- FindClusters(object = data.combined,
resolution = 0.5,
verbose = FALSE)
# t-SNE visualization
p1 <- DimPlot(object = data.combined,
reduction = "tSNE_SeuBatch",
group.by = "donor")
p2 <- DimPlot(object = data.combined,
reduction = "tSNE_SeuBatch",
group.by = "day")
p1 + p2
Visualize using Bioconductor functions
sce_intSeurat <- as.SingleCellExperiment(data.combined)
# without Seurat batch correction
p1 <- plotTSNE(sce,
colour_by = "day")
p2 <- plotTSNE(sce,
colour_by = "donor")
p1 + p2
# with Seurat batch correction
# install harmony from github
library(devtools)
## Loading required package: usethis
## Registered S3 method overwritten by 'cli':
## method from
## print.boxx spatstat.geom
install_github("immunogenomics/harmony",
dependencies = TRUE,
force = TRUE)
## Downloading GitHub repo immunogenomics/harmony@HEAD
## Skipping 1 packages not available: SingleCellExperiment
##
checking for file ‘/private/var/folders/hg/dfv6rqms0y11rd__cr82qlyw0000gn/T/RtmpJIXkQN/remotesd30c722b2426/immunogenomics-harmony-c93de54/DESCRIPTION’ ...
✓ checking for file ‘/private/var/folders/hg/dfv6rqms0y11rd__cr82qlyw0000gn/T/RtmpJIXkQN/remotesd30c722b2426/immunogenomics-harmony-c93de54/DESCRIPTION’
##
─ preparing ‘harmony’:
##
checking DESCRIPTION meta-information ...
✓ checking DESCRIPTION meta-information
##
─ cleaning src
##
─ checking for LF line-endings in source and make files and shell scripts
##
─ checking for empty or unneeded directories
##
─ building ‘harmony_0.1.0.tar.gz’
##
##
library(harmony)
## Loading required package: Rcpp
set.seed(684864)
sce <- harmony::RunHarmony(object = sce,
group.by.vars = c("donor", "experiment"),
reduction = "PCA",
reduction.save = "HARMONY_donor_experiment",
verbose = FALSE)
reducedDim(sce,type="PCA")[1:5,1:2]
## PC1 PC2
## 21554_5#128 -27.328077 9.763073
## 21554_5#142 -26.937387 8.439599
## 21554_5#174 -16.446209 16.527976
## 21554_5#176 -4.001995 15.540162
## 21554_5#181 -22.177901 7.610681
reducedDim(sce,type="HARMONY_donor_experiment")[1:5,1:2]
## HARMONY_donor_experiment_1 HARMONY_donor_experiment_2
## 21554_5#128 -24.885620 9.748495
## 21554_5#142 -24.315864 5.124895
## 21554_5#174 -14.981704 15.381010
## 21554_5#176 7.842832 9.330105
## 21554_5#181 -19.834779 7.479928
ggplot(data = as.data.frame(reducedDim(sce,type="PCA")[,1:2]),
aes(x=PC1,y=PC2)) +
geom_point(aes(colour = as.factor(sce$day))) +
theme_bw()
ggplot(data = as.data.frame(reducedDim(sce,type="HARMONY_donor_experiment")[,1:2]),
aes(x=HARMONY_donor_experiment_1,y=HARMONY_donor_experiment_2)) +
geom_point(aes(colour = as.factor(sce$day))) +
theme_bw()
ggplot(data = as.data.frame(reducedDim(sce,type="PCA")[,1:2]),
aes(x=PC1,y=PC2)) +
geom_point(aes(colour = as.factor(sce$donor))) +
theme_bw() +
theme(legend.title = element_blank()) +
facet_wrap(~as.factor(sce$day), scales="free", ncol=1)
ggplot(data = as.data.frame(reducedDim(sce,type="HARMONY_donor_experiment")[,1:2]),
aes(x=HARMONY_donor_experiment_1, y=HARMONY_donor_experiment_2)) +
geom_point(aes(colour = as.factor(sce$donor))) +
theme_bw() +
theme(legend.title = element_blank()) +
facet_wrap(~as.factor(sce$day), scales="free", ncol=1)
sce <- runTSNE(sce,
dimred = 'HARMONY_donor_experiment',
external_neighbors=TRUE,
name = "TSNE_HARMONY_donor_experiment")
# no batch versus batch corrected, color by day
p1 <- plotReducedDim(sce,
dimred = "TSNE",
colour_by = "day")
p2 <- plotReducedDim(sce,
dimred = "TSNE_HARMONY_donor_experiment",
colour_by = "day")
p1 + p2
# no batch versus batch corrected, color by donor
p3 <- plotReducedDim(sce,
dimred = "TSNE",
colour_by = "donor")
p4 <- plotReducedDim(sce,
dimred = "TSNE_HARMONY_donor_experiment",
colour_by = "donor")
p3 + p4
# no batch versus batch corrected, color by experiment
p5 <- plotReducedDim(sce,
dimred = "TSNE",
colour_by = "experiment")
p6 <- plotReducedDim(sce,
dimred = "TSNE_HARMONY_donor_experiment",
colour_by = "experiment")
p5 + p6
saveRDS(sce, "sce_after_batch.rds")
We may split the process in two more intuitive steps:
Compute the pairwise distances between all cells. These are by default euclidean distances and, in order to reduce data complexity and increase signal to noise, we may perform this on the top (30) PC’s. Implemented in the dist function.
This function performs a hierarchical cluster analysis the distances from step1. Initially, each cell is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster. Implemented in the hclust function.
Note that the hclust function allows for specifying a “method” argument. The differences between the different methods goes beyond the scope of this session, but a brief description is provided in the function help file. In the context of scRNA-seq, I have mostly seen the use of the “ward.D2” method.
distsce <- dist(reducedDim(sce, "HARMONY_donor_experiment"))
hcl <- hclust(distsce, method = "ward.D2")
plot(hcl, labels = FALSE)
Next, we need to “cut the tree”, i.e., choose at which resolution we want to report the (cell-type) clusters. This can be achieved with the cutree function. As an input, cutree takes the dendrogram from the hclust function and a threshold value for cutting the tree. This is either k, the number of clusters we want to report, or h, the height in the dendrogram at which we wan to cut the tree.
clust_hcl_k4 <- cutree(hcl, k = 4)
table(clust_hcl_k4)
## clust_hcl_k4
## 1 2 3 4
## 891 1010 901 904
sce$clust_hcl_k4 <- as.factor(clust_hcl_k4)
plotReducedDim(sce,
dimred = "HARMONY_donor_experiment",
colour_by="clust_hcl_k4")
plotReducedDim(sce,
dimred = "HARMONY_donor_experiment",
colour_by ="day")
Wikipedia provides a decent high-level description of this trajectory inference:
“Trajectory inference or pseudotemporal ordering is a computational technique used in single-cell transcriptomics to determine the pattern of a dynamic process experienced by cells and then arrange cells based on their progression through the process. […] Trajectory inference seeks to characterize [such] differences by placing cells along a continuous path that represents the evolution of the process rather than dividing cells into discrete clusters. In some methods this is done by projecting cells onto an axis called pseudotime which represents the progression through the process.”
Here, we will use slingshot to create a trajectory for the Cuomo dataset.
library(slingshot)
## Loading required package: princurve
## Loading required package: TrajectoryUtils
##
## Attaching package: 'TrajectoryUtils'
## The following object is masked from 'package:scran':
##
## createClusterMST
sce <- slingshot(sce,
start.clus = "2",
end.clus = "3",
clusterLabels = "clust_hcl_k4",
reducedDim = "HARMONY_donor_experiment")
plot(reducedDims(sce)$HARMONY_donor_experiment[,c(1,2)],
col = as.factor(sce$clust_hcl_k4),
pch=16,
asp = 1)
lines(SlingshotDataSet(sce),
lwd=2,
type = 'lineages',
col = 'black')
plot(reducedDims(sce)$HARMONY_donor_experiment,
col = as.factor(sce$day),
pch=16,
asp = 1)
lines(SlingshotDataSet(sce),
lwd=2,
type = 'lineages',
col = 'black')
library(tradeSeq)
### Find knots
# We first need to decide on the number of knots. This is done using the -->
# `evaluateK` function. This takes a little time. -->
# takes 9min for me
set.seed(5)
icMat <- evaluateK(counts = assays(sce)$counts,
sds = sling$slingshot,
k = 3:10,
nGenes = 500,
verbose = T)
set.seed(7)
subset_genes <- sample(rownames(sce), 1000, replace = FALSE)
# genes from paper
markers <- c("ENSG00000111704", "ENSG00000164458", "ENSG00000141448")
# make sure the genes from the paper are in there
subset_genes <- c(subset_genes, markers[!markers %in% subset_genes])
#20min for all genes, ±2min30 for 1000 genes
pseudotime <- slingPseudotime(sce, na = FALSE)
cellWeights <- slingCurveWeights(sce)
sce_fit <- fitGAM(counts = assays(sce)$counts[subset_genes,],
pseudotime = pseudotime,
cellWeights = cellWeights,
nknots = 6,
verbose = TRUE)
table(rowData(sce_fit)$tradeSeq$converged)
##
## TRUE
## 1003
# ±20sec
assoRes <- associationTest(sce_fit)
head(assoRes)
## waldStat df pvalue meanLogFC
## ENSG00000203879 2172.646154 5 0.0000000 0.9126307
## ENSG00000169567 577.162526 5 0.0000000 0.1359348
## ENSG00000135926 136.282751 5 0.0000000 0.3622003
## ENSG00000113645 228.810543 5 0.0000000 0.8917382
## ENSG00000151612 9.050327 5 0.1070735 0.1141678
## ENSG00000100325 92.663116 5 0.0000000 0.4171203
sum(p.adjust(assoRes$pvalue, method = "BH") < 0.05, na.rm=T)/nrow(assoRes)
## [1] 0.892323
# @Koen ±90% significant (?)
startRes <- startVsEndTest(sce_fit)
oStart <- order(startRes$waldStat, decreasing = TRUE)
for (i in 1:5) {
sigGeneStart <- oStart[i] # top 5 most significant genes in the start vs. end test
print(plotSmoothers(sce_fit,
assays(sce_fit)$counts,
gene = sigGeneStart) +
ggtitle(rownames(sce)[sigGeneStart]))
}
In the Cuomo paper, the authors highlighted the following genes:
plotSmoothers(sce_fit,
assays(sce_fit)$counts,
gene = which(rownames(sce_fit) == "ENSG00000111704"))
plotSmoothers(sce_fit,
assays(sce_fit)$counts,
gene = which(rownames(sce_fit) == "ENSG00000164458"))
plotSmoothers(sce_fit,
assays(sce_fit)$counts,
gene = which(rownames(sce_fit) == "ENSG00000141448"))
A very nice correspondence with the results presented in the paper!!!!!!!!!!
plotGeneCount(sce$slingshot,
assays(sce_fit)$counts,
gene = which(rownames(sce_fit) == "ENSG00000111704"))
plotGeneCount(sce$slingshot,
assays(sce_fit)$counts,
gene = which(rownames(sce_fit) == "ENSG00000164458"))
plotGeneCount(sce$slingshot,
assays(sce_fit)$counts,
gene = which(rownames(sce_fit) == "ENSG00000141448"))